02 - Example - Handling Duplicates and Missing Values

This notebook presents how to eliminate duplicates and solve the missing values.

By: Hugo Lopes
Learning Unit 08

Some inital imports:



In [1]:

    
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt

Load the dataset that will be used

Using the data from the previous unit (07-data-diagnostics)



In [2]:

    
data = pd.read_csv('../data/data_with_problems.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)









    



Our dataset has 3 columns (features) and 200 rows (people).






    Out[2]:






  
    
      
      age
      height
      gender
    
  
  
    
      CFLOXRHMDR
      88.0
      163.0
      female
    
    
      FXLJSNLSOG
      29.0
      158.0
      female
    
    
      FWDIVJKGOI
      42.0
      159.0
      female
    
    
      YWEBKQWHRE
      25.0
      179.0
      male
    
    
      YPUQAPSOYJ
      32.0
      169.0
      male
    
    
      YPUQAPSOYJ
      32.0
      169.0
      male
    
    
      YPUQAPSOYJ
      32.0
      169.0
      male
    
    
      YPUQAPSOYJ
      32.0
      169.0
      male
    
    
      SSZQEGTLNK
      NaN
      162.0
      male
    
    
      PRFEFXNGWN
      36.0
      166.0
      female
    
    
      IIVXDNOAIV
      1.0
      165.0
      female
    
    
      VVQYVNRAGQ
      18.0
      134.0
      female
    
    
      YVEDWPTEEB
      31.0
      149.0
      female
    
    
      LGPQPJXWPI
      34.0
      172.0
      female
    
    
      VURXKXJUTM
      22.0
      174.0
      female

Dealing with found issues

Time to deal with the issues previously found.

1) Duplicated data

Drop the duplicated rows (which have all column values the same), check the YPUQAPSOYJ row above. Let us use the drop_duplicates to help us with that by keeping only the first of the duplicated rows.



In [3]:

    
mask_duplicated = data.duplicated(keep='first')
mask_duplicated.head(10)









    Out[3]:





CFLOXRHMDR    False
FXLJSNLSOG    False
FWDIVJKGOI    False
YWEBKQWHRE    False
YPUQAPSOYJ    False
YPUQAPSOYJ     True
YPUQAPSOYJ     True
YPUQAPSOYJ     True
SSZQEGTLNK    False
PRFEFXNGWN    False
dtype: bool



In [4]:

    
data = data.drop_duplicates(keep='first')
print('Our dataset has now %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(10)









    



Our dataset has now 3 columns (features) and 194 rows (people).






    Out[4]:






  
    
      
      age
      height
      gender
    
  
  
    
      CFLOXRHMDR
      88.0
      163.0
      female
    
    
      FXLJSNLSOG
      29.0
      158.0
      female
    
    
      FWDIVJKGOI
      42.0
      159.0
      female
    
    
      YWEBKQWHRE
      25.0
      179.0
      male
    
    
      YPUQAPSOYJ
      32.0
      169.0
      male
    
    
      SSZQEGTLNK
      NaN
      162.0
      male
    
    
      PRFEFXNGWN
      36.0
      166.0
      female
    
    
      IIVXDNOAIV
      1.0
      165.0
      female
    
    
      VVQYVNRAGQ
      18.0
      134.0
      female
    
    
      YVEDWPTEEB
      31.0
      149.0
      female

You could also consider a duplicate a row with the same index and same age only by setting data.drop_duplicates(subset=['age'], keep='first'), but in our case it would lead to the same result. Note that in general it is not a recommended programming practice to use the argument 'inplace=True' (e.g., data.drop_duplicates(subset=['age'], keep='first', inplace=True)) --> may lead to unnexpected results.

2) Missing Values

This one of the major, if not the biggest, data problems that we faced with. There are several ways to deal with them, e.g.:

drop rows which contain missing values.
fill missing values with zero.
fill missing values with mean of the column the missing value is located.
(more advanced) use decision trees to predict the missing values.



In [5]:

    
missing_data = data.isnull()
print('Number of missing values (NaN) per column/feature:')
print(missing_data.sum())
print('And we currently have %d rows.' % data.shape[0])









    



Number of missing values (NaN) per column/feature:
age       9
height    4
gender    9
dtype: int64
And we currently have 194 rows.

That is not terrible to the point of fully dropping a column/feature due to the amount of missing values. Nevertheless, the action to do that would be data.drop('age', axis=1). The missing_data variable is our mask for the missing values:



In [6]:

    
missing_data.head(8)









    Out[6]:






  
    
      
      age
      height
      gender
    
  
  
    
      CFLOXRHMDR
      False
      False
      False
    
    
      FXLJSNLSOG
      False
      False
      False
    
    
      FWDIVJKGOI
      False
      False
      False
    
    
      YWEBKQWHRE
      False
      False
      False
    
    
      YPUQAPSOYJ
      False
      False
      False
    
    
      SSZQEGTLNK
      True
      False
      False
    
    
      PRFEFXNGWN
      False
      False
      False
    
    
      IIVXDNOAIV
      False
      False
      False

Drop rows with missing values

This can be done with dropna(), for instance:



In [7]:

    
data_aux = data.dropna(how='any')
print('Dataset now with %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))









    



Dataset now with 3 columns (features) and 173 rows (people).

Fill missing values with a specific value (e.g., 0)

This can be done with fillna(), for instance:



In [8]:

    
data_aux = data.fillna(value=0)
print('Dataset has %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))









    



Dataset has 3 columns (features) and 194 rows (people).

So, what happened with our dataset? Let's take a look where we had missing values before:



In [9]:

    
data_aux[missing_data['age']]









    Out[9]:






  
    
      
      age
      height
      gender
    
  
  
    
      SSZQEGTLNK
      0.0
      162.0
      male
    
    
      TJQPFEFVVH
      0.0
      182.0
      0
    
    
      PYHWLDVICX
      0.0
      181.0
      female
    
    
      MLRPKGKACD
      0.0
      185.0
      male
    
    
      SGMGUJEBNC
      0.0
      173.0
      MALE
    
    
      YZDOYNOXAF
      0.0
      144.0
      female
    
    
      UAOAMGUQSX
      0.0
      144.0
      male
    
    
      JFVZOEGUUA
      0.0
      208.0
      female
    
    
      VYAQBLJKXJ
      0.0
      165.0
      male



In [10]:

    
data_aux[missing_data['height']]









    Out[10]:






  
    
      
      age
      height
      gender
    
  
  
    
      CWCFROPRFE
      22.0
      0.0
      male
    
    
      EORSIPDIHA
      21.0
      0.0
      MALE
    
    
      NGJOHICWSY
      41.0
      0.0
      male
    
    
      LNLAPFIJEQ
      37.0
      0.0
      male



In [11]:

    
data_aux[missing_data['gender']]









    Out[11]:






  
    
      
      age
      height
      gender
    
  
  
    
      TJQPFEFVVH
      0.0
      182.0
      0
    
    
      QXUGUHCOPT
      101.0
      196.0
      0
    
    
      LKEHZFGGTS
      49.0
      177.0
      0
    
    
      EBTRPEDHJS
      43.0
      147.0
      0
    
    
      BDFQWIHWCH
      27.0
      167.0
      0
    
    
      NUCCGRJLXN
      20.0
      159.0
      0
    
    
      GQSNBZIGBL
      27.0
      197.0
      0
    
    
      KWJJMPVSCP
      24.0
      189.0
      0
    
    
      LMZUTCGFYT
      21.0
      153.0
      0

Looks like what we did was not the most appropriate. For instance, we create a new category in the gender column:



In [12]:

    
data_aux['gender'].value_counts()









    Out[12]:





female    111
male       65
MALE        9
0           9
Name: gender, dtype: int64

Fill missing values with mean/mode/median:

This is one of the most common approaches.

`height` - filling missing values with the mean



In [13]:

    
data['height'] = data['height'].replace(np.nan, data['height'].mean())
data[missing_data['height']]









    Out[13]:






  
    
      
      age
      height
      gender
    
  
  
    
      CWCFROPRFE
      22.0
      168.847368
      male
    
    
      EORSIPDIHA
      21.0
      168.847368
      MALE
    
    
      NGJOHICWSY
      41.0
      168.847368
      male
    
    
      LNLAPFIJEQ
      37.0
      168.847368
      male

age - filling missing values with median



In [14]:

    
data.loc[missing_data['age'], 'age'] = data['age'].median()
data[missing_data['age']]









    Out[14]:






  
    
      
      age
      height
      gender
    
  
  
    
      SSZQEGTLNK
      27.0
      162.0
      male
    
    
      TJQPFEFVVH
      27.0
      182.0
      NaN
    
    
      PYHWLDVICX
      27.0
      181.0
      female
    
    
      MLRPKGKACD
      27.0
      185.0
      male
    
    
      SGMGUJEBNC
      27.0
      173.0
      MALE
    
    
      YZDOYNOXAF
      27.0
      144.0
      female
    
    
      UAOAMGUQSX
      27.0
      144.0
      male
    
    
      JFVZOEGUUA
      27.0
      208.0
      female
    
    
      VYAQBLJKXJ
      27.0
      165.0
      male

gender - filling missing values with mode

Remember we had a small problem with the data of this feature (the MALE word instead of male)? Typing problems are very common and they can be hidden problems. That's why it is so important to take a look at the data.



In [15]:

    
data['gender'].value_counts(dropna=False)









    Out[15]:





female    111
male       65
MALE        9
NaN         9
Name: gender, dtype: int64

Let's replace MALE by male to harmonize our feature.



In [16]:

    
mask = data['gender'] == 'MALE'
data.loc[mask, 'gender'] = 'male'
# validate we don't have MALE:
data['gender'].value_counts(dropna=False)









    Out[16]:





female    111
male       74
NaN         9
Name: gender, dtype: int64

Now we don't have the MALE entry anymore. Let us fill the missing values with the mode:



In [17]:

    
the_mode = data['gender'].mode()
# note that mode() return a dataframe
the_mode









    Out[17]:





0    female
dtype: object



In [18]:

    
data['gender'] = data['gender'].replace(np.nan, data['gender'].mode()[0])
data[missing_data['gender']]









    Out[18]:






  
    
      
      age
      height
      gender
    
  
  
    
      TJQPFEFVVH
      27.0
      182.0
      female
    
    
      QXUGUHCOPT
      101.0
      196.0
      female
    
    
      LKEHZFGGTS
      49.0
      177.0
      female
    
    
      EBTRPEDHJS
      43.0
      147.0
      female
    
    
      BDFQWIHWCH
      27.0
      167.0
      female
    
    
      NUCCGRJLXN
      20.0
      159.0
      female
    
    
      GQSNBZIGBL
      27.0
      197.0
      female
    
    
      KWJJMPVSCP
      24.0
      189.0
      female
    
    
      LMZUTCGFYT
      21.0
      153.0
      female

Final check

Always a good idea...



In [19]:

    
data.isnull().sum()









    Out[19]:





age       0
height    0
gender    0
dtype: int64

	age	height	gender
CFLOXRHMDR	88.0	163.0	female
FXLJSNLSOG	29.0	158.0	female
FWDIVJKGOI	42.0	159.0	female
YWEBKQWHRE	25.0	179.0	male
YPUQAPSOYJ	32.0	169.0	male
YPUQAPSOYJ	32.0	169.0	male
YPUQAPSOYJ	32.0	169.0	male
YPUQAPSOYJ	32.0	169.0	male
SSZQEGTLNK	NaN	162.0	male
PRFEFXNGWN	36.0	166.0	female
IIVXDNOAIV	1.0	165.0	female
VVQYVNRAGQ	18.0	134.0	female
YVEDWPTEEB	31.0	149.0	female
LGPQPJXWPI	34.0	172.0	female
VURXKXJUTM	22.0	174.0	female

	age	height	gender
CFLOXRHMDR	False	False	False
FXLJSNLSOG	False	False	False
FWDIVJKGOI	False	False	False
YWEBKQWHRE	False	False	False
YPUQAPSOYJ	False	False	False
SSZQEGTLNK	True	False	False
PRFEFXNGWN	False	False	False
IIVXDNOAIV	False	False	False

	age	gender
CWCFROPRFE	22.0	male
EORSIPDIHA	21.0	MALE
NGJOHICWSY	41.0	male
LNLAPFIJEQ	37.0	male

	age	height
TJQPFEFVVH	0.0	182.0
QXUGUHCOPT	101.0	196.0
LKEHZFGGTS	49.0	177.0
EBTRPEDHJS	43.0	147.0
BDFQWIHWCH	27.0	167.0
NUCCGRJLXN	20.0	159.0
GQSNBZIGBL	27.0	197.0
KWJJMPVSCP	24.0	189.0
LMZUTCGFYT	21.0	153.0

	age	height	gender
CWCFROPRFE	22.0	168.847368	male
EORSIPDIHA	21.0	168.847368	MALE
NGJOHICWSY	41.0	168.847368	male
LNLAPFIJEQ	37.0	168.847368	male